Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15233: [C++] Fix CopyFiles when destination is a FileSystem with background_writes #44897

Merged

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Dec 2, 2024

Rationale for this change

Currently it deadlocks when trying to upload more files than we have IO threads.

What changes are included in this PR?

  1. Use CloseAsync() inside copy_one_file so that it returns futures, instead of blocking while it waits for other tasks to complete. This avoids the deadlock.
  2. Fix a segfault in FileInterface::CloseAsync() by using shared_from_this() to so that it doesn't get destructed prematurely. shared_from_this() is already used in the CloseAsync() implementation on several filesystems. After change 1 this is important when downloading to the local filesystem using CopyFiles.
  3. Spawn copy_one_file less urgently than default, so that background_writes are done with higher priority. Otherwise copy_one_file will keep buffering more data in memory without giving the background_writes any chance to upload the data and drop it from memory. Therefore, without this large copies would cause OOMs.
    1. The current Arrow thread pool implementation makes provision for spawning with a set priority but the priority numbers are just ignored.
    2. I made a small change to use a std::priority_queue to implement the priority.
    3. There is a property of the current implementation that tasks will be scheduled in the order of calls to Spawn. There are also a few test cases in arrow-acero-hash-aggregate-test that fail if this property is not maintained. I'm not sure what is the correct option but for now I have ensured that tasks of the same priority are run in the order they are spawned.
    4. I think this has got to be the most contentious part of the change, with the broadest possible impact if I introduce a bug. I would appreciate some input on whether this seems like a reasonable change and I'm very happy to move it to a separate PR and/or discuss further.

Are these changes tested?

Added some extra automated tests. I also ran some large scale tests copying between Azure and local with a directory of about 50,000 files of varying sizes totalling 160GiB.

Are there any user-facing changes?

  1. CopyFiles should now work reliably
  2. ThreadPool now runs tasks in priority order

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 4, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good on the principle.

By the way, this seems based on a rather old version of git main, can you update?

cpp/src/arrow/util/thread_pool.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/thread_pool.cc Show resolved Hide resolved
cpp/src/arrow/util/thread_pool_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/thread_pool_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/io/interfaces.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem.cc Show resolved Hide resolved
@Tom-Newton Tom-Newton force-pushed the tomnewton/fix_pyarrow_copy_files_uploads/GH-15233 branch from 1294e38 to dbefb7e Compare December 4, 2024 14:49
Copy link
Contributor Author

@Tom-Newton Tom-Newton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased on the latest main and I think I've addressed all the comments

cpp/src/arrow/util/thread_pool.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/test_util.cc Outdated Show resolved Hide resolved
@Tom-Newton Tom-Newton marked this pull request as ready for review December 4, 2024 14:54
@Tom-Newton Tom-Newton requested a review from pitrou December 9, 2024 13:18
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Here are some more comments.

cpp/src/arrow/filesystem/test_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/thread_pool.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/thread_pool_test.cc Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's wait for CI

@pitrou
Copy link
Member

pitrou commented Dec 10, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Dec 10, 2024

Hmm, it looks like we'll either have to fix the serial executor to also take priorities into account (if that's at all possible?), or simply disable the priority test when threading is disabled.

@Tom-Newton
Copy link
Contributor Author

Hmm, it looks like we'll either have to fix the serial executor to also take priorities into account (if that's at all possible?), or simply disable the priority test when threading is disabled.

I think we can implement priorities exactly the same on SerialExecutor as on ThreadPool, so I pushed a commit to do that. I feel like this should be covered by the normal CI though, without needing to build with ARROW_ENABLE_THREADING=OFF, so I'm going to see how easy it is to parameterise the new tests I added.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Dec 11, 2024

I feel like this should be covered by the normal CI though, without needing to build with ARROW_ENABLE_THREADING=OFF, so I'm going to see how easy it is to parameterise the new tests I added.

I think it would at least require making the constructor of SerialExecutor public, so maybe its not really desirable to add such a test?

@Tom-Newton Tom-Newton force-pushed the tomnewton/fix_pyarrow_copy_files_uploads/GH-15233 branch from 697c747 to 7573d11 Compare December 12, 2024 22:28
@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Dec 13, 2024

@pitrou do you have any opinion on extra tests for priorities in SerialExecutor? If you are happy please can you trigger the crossbow tests again.

@Tom-Newton Tom-Newton force-pushed the tomnewton/fix_pyarrow_copy_files_uploads/GH-15233 branch from 93c00e9 to 7573d11 Compare December 13, 2024 16:31
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, just a nit

cpp/src/arrow/util/thread_pool.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Dec 16, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@Tom-Newton
Copy link
Contributor Author

Sorry, it looks like there are some bits that I missed when converting SerialExecutor to priority queue because I never tested it locally with ARROW_ENABLE_THREADING=OFF

@Tom-Newton
Copy link
Contributor Author

Ok, this time I managed to test it locally ARROW_ENABLE_THREADING=OFF so hopefully I haven't missed anything else.

@pitrou
Copy link
Member

pitrou commented Dec 16, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: 5774bf2

Submitted crossbow builds: ursacomputing/crossbow @ actions-372e4c30db

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@Tom-Newton
Copy link
Contributor Author

Thanks for triggering the extra tests again. That run is looking better but it looks like a couple failed because of a timeout downloading a dependency

    error: downloading 'https://archive.apache.org/dist/orc/orc-format-1.0.0/orc-format-1.0.0.tar.gz' failed
          status_code: 28
          status_string: "Timeout was reached"
          log:
          --- LOG BEGIN ---
            Trying 65.108.204.189:443...
    Trying 2a01:4f9:1a:a084::2:443...

Presumably this is not related to me changes.

@pitrou
Copy link
Member

pitrou commented Dec 16, 2024

Yes, I've restarted those CI builds.

@pitrou
Copy link
Member

pitrou commented Dec 16, 2024

Ok, thanks a lot for doing this @Tom-Newton !

@pitrou pitrou changed the title GH-15233: Fix CopyFiles when destination is a FileSystem with background_writes GH-15233: [C++] Fix CopyFiles when destination is a FileSystem with background_writes Dec 16, 2024
@pitrou pitrou merged commit 894f5fb into apache:main Dec 16, 2024
38 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Dec 16, 2024
@Tom-Newton
Copy link
Contributor Author

Ok, thanks a lot for doing this @Tom-Newton !

Thanks for reviewing

Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 894f5fb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants